Phoneme-to-viseme Mapping for Visual Speech Recognition
نویسندگان
چکیده
Phonemes are the standard modelling unit in HMM-based continuous speech recognition systems. Visemes are the equivalent unit in the visual domain, but there is less agreement on precisely what visemes are, or how many to model on the visual side in audio-visual speech recognition systems. This paper compares the use of 5 viseme maps in a continuous speech recognition task. The focus of the study is visual-only recognition to examine the choice of viseme map. All the maps are based on the phoneme-to-viseme approach, created either using a linguistic method or a data driven method. DCT, PCA and optical flow are used as the visual features. The best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set. These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development.
منابع مشابه
Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis
The use of visemes as atomic speech units in visual speech analysis and synthesis systems is well-established. Viseme labels are determined using a many-to-one phoneme-to-viseme mapping. However, due to the visual coarticulation effects, an accurate mapping from phonemes to visemes should define a many-to-many mapping scheme. In this research it was found that neither the use of standardized no...
متن کاملDecoding visemes: improving machine lipreading (PhD thesis)
This thesis is about improving machine lip-reading, that is, the classification of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision. Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the vid...
متن کاملPrimary research on the viseme system in Standard Chinese
The study of traditional phonetics indicates the shape of lips takes important effect on the articulations of consonants and vowels. [1]. AVSP (Audio-Visual Speech Processing) can improve the naturalness of synthetical speech and recognition rate of the speech recognition system. Especially in computer-synthesized face, the movements of lip-shape play a crucial role. The present research aims t...
متن کاملAutomatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading
Speech is the most common communication method between humans and involves the perception of both auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, but it has been demonstrated that video can provide information that is complementary to the audio. Thus, the study of automatic lip-reading is important and is still an open problem. One of the ke...
متن کاملPhoneme - Viseme Mapping for German Video - Realistic Audio - Visual - Speech - Synthesis IKP - Working Paper NF 11
In this working paper we introduce a German viseme set which we already use in our data-driven audio-visual synthesis system. The viseme set is essential for speech driven audio-visual synthesis due to the fact that the selection of appropriate video segments is based on the visemically transcribed input text. For text-to-speech synthesis, a transcription of the input text into the phonemic rep...
متن کامل